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Dual Augmented Lagrangian Method for 
Efficient Sparse Reconstruction 
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Abstract 

We propose an efficient algorithm for sparse signal reconstruction problems. The proposed algorithm 
is an augmented Lagrangian method based on the dual sparse reconstruction problem. It is efficient 
when the number of unknown variables is much larger than the number of observations because of the 
dual formulation. Moreover, the primal variable is explicitly updated and the sparsity in the solution is 
exploited. Numerical comparison with the state-of-the-art algorithms shows that the proposed algorithm 
is favorable when the design matrix is poorly conditioned or dense and very large. 

EDICS category: SAS-STAT, SAS-MALN 

I. Introduction 

Sparse signal reconstruction has recently gained considerable interests in signal/image processing and 
machine learning. Sparsity is often a natural assumption in inverse problems, such as MEG/EEG source 
localization and image/signal deconvolution; sparsity enables us to identify a small number of active 
components even when the dimension is much larger than the number of observations. In addition, a 
sparse model is also valuable in predictive tasks because it can explain why it is able to predict in contrast 
to black-box models such as neural networks and support vector machines. 

In this paper we consider the following particular problem that typically arises in sparse reconstruction: 

(P) minimize -|| Aw — b\\ 2 + A||ttf||i, (1) 
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where w € W 1 is the coefficient vector to be estimated, A £ K mxn is the design matrix, and b € W 11 is 
the vector of observations. It is well known that ^i-norm penalty enforces w to have many zero elements. 
It is called lasso HI in the statistics, basis pursuit denoising O in the signal processing, and FOCUSS Q 
in the brain imaging communities. 

Various methods have been proposed to efficiently solve the optimization problem £[]) (or its generalized 
versions). Iteratively reweighted shrinkage (IRS) is a popular approach for solving the problem £0) (see 
j3l , fill . 151 , 0, Q). The main idea of the IRS approach is to replace a non-differentiable (or non- 
convex) optimization problem by a series of differentiable convex ones; typically the regularizer (e.g., 
|| • ||i in Eq. (Q])) is upper bounded by a weighted quadratic regularizer. Then one can use various existing 
algorithms to minimize the upper bound. The upper bound is re-weighted after every minimization so that 
the solution eventually converges to the solution of the original problem £[)). The challenge in the IRS 
framework is the singularity Q around the coordinate axis. For example, in the l\ problem in Eq. (Q]), 
any zero component Wj = in the initial vector w will remain zero after any number of iterations. 
Moreover, it is possible to create a situation that the convergence becomes arbitrarily slow for finite \wj\ 
because the convergence in the l\ case is only linear Q. Another recent work is the split Bregman 
iteration (SBI) method (H, which is derived from the Bregman iteration algorithm [91] in order to handle 
the noisy (A > 0) case. The Bregman iteration algorithm can be considered as an augmented Lagrangian 
(AL) method (see iflOll . ifTTll . |9l). By introducing an auxiliary variable w, the SBI approach decouples 
the minimization of the first and the second term in Eq. (Q]), which can then be handled independently. 
The two variables w and w are gradually enforced to coincide with each other. Both IRS and SBI require 
solving a linear system of the size of the number of unknown variables (n) repeatedly, which may become 
challenging when n» m. 

Kim et al. |[12l developed an efficient interior-point (IP) method called ll_ls. They proposed a truncated 
Newton method for solving the inner minimization that scales well when the design matrix A is sparse. 

The iterative shrinkage/thresholding (1ST) (see lfT3l . fl4l . Ifl5l . |f9"1 ) is a classic method but it is still an 
area of active research |fl6ll , IfTTll . It alternately computes the steepest descent direction on the loss term 
in Eq. (Q]) and the soft thresholding related to the regularization term. The 1ST method has the advantage 
that every iteration is extremely light (only computes gradient) and every intermediate solution is sparse. 
However the naive version of 1ST is sensitive to the selection of step-size. Recently several authors have 
proposed intelligent step-size selection criteria lfl6ll . ifTTl . 

In this paper we propose an efficient method that scales well when n 3> m, which we call the dual 
augmented Lagrangian (DAL). It is an AL method similarly to SBI method but it is applied to the dual 
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problem; thus the inner minimization is efficient when n m. In addition, in contrast to the "divide 
and conquer" approach of SBI, the inner minimization can be performed jointly over all the variables; it 
converges super linearly because the inner minimization is solved at sufficient precision (see iflOll , ifTTTO . 
Moreover, although the proposed method is based on the dual problem, the primal variable is explicitly 
updated in the computation as the Lagrangian multiplier. DAL computes soft thresholding after every 
iteration similarly to the 1ST approach but with an improved direction as well as an automatic step-size 
selection mechanism; typically the number of outer iterations is less than 10. The proposed approach 
can be applied to large scale problems with dense design matrices because it exploits the sparsity in the 
coefficient vector w in contrast to the IP methods fl2l . which exploits the sparsity in the design matrix. 

This paper is organized as follows. In Sec. [TTJ the DAL algorithm is presented; two approaches for 
the inner minimization problem are discussed. In Sec. [Ill] we experimentally compare DAL to the state- 
of-the-art SpaRSA ifTTl and ll_ls lfl2l algorithms. We give a brief summary and future directions in 
Sec. El 

II. Method 

A. Dual augmented Lagrangian method for sparse reconstruction 

Let f(w) be the objective in Eq. £[]). The challenge in minimizing f(w) arises from its non-differentiability. 
The proposed approach is based on the minimization of a differentiable surrogate function f v (w). In this 
section we derive the surrogate function f n {w) and its gradient from the augmented Lagrangian function 
I/,, of the dual problem of Eq. £T|). 

Using the Fenchel duality (see lfT8l Sec. 5.4]) and a splitting similar to SBI (in the dual), we obtain 
the following dual problem of problem dT) (see also ifTBl ): 

(D) maximize - ~||a - 6||| + - (2) 

DSR^aelR" 1 z z 

subject to v = A T a, (3) 

where S^(v) is the indicator function lfl31 of the £00 ball of radius A, i.e., S^(v) = (if ||tt||oo < an d 
+00 (otherwise). It can be shown that the strong duality holds, i.e., the maximum of Eq. Q d(a*,v*) 
coincides with the minimum of Eq. (OQ) f(w*), where d is the objective function in Eq. (0; w* and 
(a*, v*) are the minimizer and the maximizer of the primal and dual problems, respectively. 
The augmented Lagrangian (AL) function of the dual problem (Eqs. Q and ([3])) is defined as follows: 

L v (a,v; w) = -ha - bf 2 + i||6||! - 6?{v) - w T (a t cx - v) - |||A T a - v\\l (4) 
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where w is the Lagrangian multiplier associated with the equality constraint (Eq. (0) and corresponds 
to the coefficient vector in the primal problem. The last term in Eq. (@]) is called the barrier term and 
rj(> 0) is called the barrier parameter. When 77 = 0, the AL function is reduced to the ordinary Lagrangian 
function. See liTOll . ifTTl for the details of the AL method. See also [ 19 ] for the ordinary Lagrangian duality. 
Now we define the surrogate function f v {w) as follows: 

f„{w)= max L„(ol,v;w). (5) 

Note that from the strong duality fo(w) = max Qi „ Lq(ol, v; w) = f(w). In addition, since Lq(cx, v; w) > 
L v {a. : v; w), the inequality f(w ) > f v (w ) holds. Moreover, since f v (w ) > L^a.* , v*; w) = d(a*, v*) = 
f(w*) (we use A T a* = v* to obtain the first equality), we have mm w& ^ fn( w ) = f( w *) f° r an Y 
nonnegative i]. Furthermore, f- n (w) is differentiable if rj > 0. 

The maximization with respect to v in Eq. §5§ can be computed analytically and v can be eliminated 
from Eq. ((U), as follows: 



1 / 71 

max L v (a,v-,w) = --\\a - b\\ 2 2 - min (S^(v) + - 



A T a — w/t] ) + c(w, rj) 



~\\a - bg -iWA^ + w/ri- If (A 1 a + w/rj)^ + c(w, r,) 
\\\<x ~ Hi - |l|ST A (A T a + to/r/) ||1 + c{w, r,) =: L„(a; w), 



where c(w, if) is a constant that only depends on w and 77, and is a projection on the ball of 
radius A; note that r]P^°(w) = P°^(rjw); in addition, we define the well known soft thresholding function 
SJ X (see HI, HI, HI, 10) as follows: 



SJ x (w) =w-PZ°(w) = { 



— A if wj > A, 
if -\<Wj<\, (i = 1, ...,n). 

Wj + A if wj < —A, 

Typically in an AL method the barrier parameter 77 is increased as 771 < 772 < • • •; this guarantees super 
linear convergence of the method (see ifTOl ). The coefficient vector w is updated using the gradient of 
f v (w) as follows: 

w k+1 = w k + r] k (A T Ck k - v k ) = w k + r] k (A T a k - P^{A T a k + w k /r] k )) = STx Vk (w k + r] k A T a k ), 

because V w f v (w k ) = (V w a k )V a L v (a k ;w k ) + V w L v (a k ;w k ) = -(A T a k - v k ), where a k and 
v k are the maximizer of Eq. ® at the current w k and V a L ri (a k ;w k ) = because a k maximizes 
L v (a;w k ). We can also show that f{w k ) > f v (w k ) > f(w k+ i) with strict inequality except the 
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Choose sequences r)\ < r\i < . . . and ei > £2 > Let Wi be the initial primal vector. Let k = 1. 

while Stopping criterion is not satisfied do 

Let ot k be an (approximate) minimizer of g{a) := —L Vk (a;w k ) with tolerance e k as follows: 

a. k -argminf -\\a - b\\j + ^ \\ST X (A T a + w k /r] k )\\ 2 ) , (6) 

where || V a g(a. k )\\2 < £fc and V a g(a k ) is the gradient of the above Eq. © at a k (see Eq. (|7]l). 
Update the primal coefficient vector w k as: 

w k+ i = ST\ m (w k + ri k A T ot k ) 

fc + 1. 
end while 

Fig. 1. Dual augmented Lagrangian method for sparse signal reconstruction (see Eqs. l[T) and l[2}-) 

minimum of Eq. (Q} (TUl Chap. 5]. Accordingly the dual augmented Lagrangian method can be described 
as in Fig. Q] 

B. Inner minimization 

Let g(a) be the objective function in Eq. ([6]); g(a) is once differentiable everywhere and also twice 
differentiable except the points on which the above soft thresholding function switches. We use the 
Newton method for the minimization of g(a). The gradient and the Hessian of the objective function 
g(a) can be written as follows: 

V a g((x) = a - b + Vk ASJ x (q), (7) 
V 2 a g(a) = I m + r 1k A + A + T , (8) 

where q = A 1 a + w k /r] k , I m is the identity matrix of size m, and A + is the submatrix of A that 
consists of "active" columns with indices J + = {j € {1,2, ... ,n} : \qj\ > A}. Note that in both the 
computation of the gradient and the Hessian, computational complexity is only proportional to the number 
of active components of q. The discontinuity of the second derivative is in general not a problem. In fact, 
we can see from the complementary slackness condition that for finite rj the optimal solution-multiplier 
pair (w*, a*) is on a regular point; thus the convergence around the minimum is quadratic. 

We propose two approaches for solving the Newton system V 2 a g(a.)y = — X7 a g(a). The first ap- 
proach (DALchol) uses the Cholesky factorization of the Hessian matrix V 2 a g{a). The second approach 
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(DALcg) uses a preconditioned conjugate gradient method (the truncated Newton method in lfT2l ) with 
a preconditioner that only consists of the diagonal elements of the Hessian matrix. Finally the standard 
backtracking line-search with initial step-size 1 is applied to guarantee decrease in the objective g(ct). 

III. Empirical comparisons 

We test the computational efficiency of the proposed DAL algorithm on the £2-^1 problem (Eq. CO) 
under various conditions. The DAL algorithm is compared to two state-of-the-art algorithms, namely 
ll_ls (interior-point algorithm, lfl2l ) and SpaRSA (step-size improved 1ST, ifTTl '). 

A. Experimental settings 



In the first experiment (Fig. |2(a)| ), the elements of the design matrix A are sampled from the indepen- 
dent zero-mean Gaussian distribution with variance l/(2n). This choice of variance makes the largest 
singularvalue of A approximately one ifiTl . The true coefficient vector wq is generated by randomly filling 
4% of its elements by +1 or —1 which is also randomly chosen. The remaining elements are zero. The 
target vector b is generated as b = Awq+£, where £ is sampled from the zero-mean Gaussian distribution 
with variance 1CP 4 . The number of observations (m) is increased from m = 128 to m = 8, 196 while 
the number of variables (n) is increased proportionally as n = 4m. The regularization constant A is kept 
constant at 0.025, which is found to approximately correspond to the choice A = 0.1||A T 6|| oo in flTTl 1 . 
In the second experiment (Fig. |2(b)| ), the setting is almost the same except that the singular values of 
A is replaced by a series decreasing as 1/s for the s-th singular value. Thus the condition number (the 
ratio between the smallest and the largest singular values) of A is m. Additionally we set the variance of 
£ to zero (no noise) and keep A constant at 0.0003, which is also found to approximately correspond to 
the setting in IfTTl In the last experiment (Fig. [3]), the number of observations (m) is kept at m = 1, 024 
and the number of samples (n) is increased from n = 4,096 to n = 1,048, 576. The design matrix A 
and the target vector b are generated as in the first experiment. In addition, the regularization constant 
A is decreased as A = 1.6/n 1 / 2 , which equals 0.025 at n = 4,096 and is again chosen to approximately 
match the setting in IfTTl . In each figure, we show the computation time, the number of steps, and the 
sparsity of the solution (the proportion of non-zero elements in the final solution) from top to bottom. 
All the results are averaged over 10 random initial coefficient vectors w. All the experiments are run 

'A fair comparison at a smaller regularization constant would require continuation techniques, which should be addressed in 
a separate paper. 
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on MATLAB 7.7 (R2008b) on a workstation with two 3.0GHz quad-core Xeon processors and 16GB of 
memory. 

B. Practical issues 

1) Stopping criterion: We use the "duality" stopping criterion proposed in ifTTll for all the results 
presented in the next section. More precisely, we generate a dual variable a as follows, 

a = Xa/\\A T a\\ oc , 

where a = Aw — b is the gradient of the primal loss term in Eq. (Q]). The above defined a is a feasible 
point of the dual problem (Eq. (O) by definition, i.e., ||A T a|| 00 < A. Thus we use the primal-dual pair 
(w, a) to measure the relative duality gap (/(«?) — d(a, A T a))/ f(w), where / and d are the objective 
functions in the primal problem (Eq. ([TJ) and the dual problem (Eq. (O), respectively. The tolerance 
10 -3 is used. 

2) Hyperparameters: The tolerance parameter for the inner minimization is chosen as follows. 
We use ei = 10~ 4 • m 1//2 and decrease as = £fc_i/2. Using larger results in cheaper inner 
minimization but often requires a larger number of outer iterations. The barrier parameter rj^ also affects 
the behavior of the algorithm. Typically larger rjh gives larger reduction in the duality gap at every 
iteration but makes the inner minimization more difficult. Additionally the best value of rjk depends on 
the size of the problem, regularization constant A, and the spectrum of A. We manually choose r]i for 
each problem in the next section and increase % as rjk = ^Vk-i, which guarantees the super-linear 
convergence ifTOl . 

C. Results 



When the data is well conditioned (Fig. 2(a)| ), SpaRSA performs clearly the best within the three 



algorithms. The proposed DAL algorithm with the conjugate gradient (DALcg) performs comparable to 
ll_ls. The proposed DAL with the Cholesky factorization (DALchol) is less efficient than DALcg when 
m is large because the complexity grows as 0(m 3 ); note however that the cost for building the Hessian 
matrix is only 0(m 2 n + ), where n + is the number of active components (see Eq. ©). 

In contrast, when the data is poorly conditioned (Fig. |2(b)| ), the proposed DALcg runs almost 100 
times faster than SpaRSA at most. This can be clearly seen in the number of steps (the middle row). 
Although the numbers of steps DAL and ll_ls require are almost constant from Fig. |2(a)| to Fig. |2(b) 



that of SpaRSA is increased at least by the factor 10. Note that the sparsity of the solution is decreasing 
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Number of observations Number of observations 



(a) Normal conditioning (b) Poor conditioning 

Fig. 2. Comparison of running time and number of steps for three optimization algorithms, namely DAL, SpaRSA and 11 Is 

for problems of various sizes with (a) design matrix A generated from independent normal random variables and (b) the same 
matrix with singular values replaced by a power-law distribution. The horizontal axis denotes the number of observations (m). 
The number of variables is n — 4m. The regularization constant A is fixed at A = 0.025 in (a) and A = 0.0003 in (b). 



as the number of samples increases. This may explain why the proposed DAL algorithm is more robust 

to poor conditioning than 1 1 Is because 11 Is does not exploit the sparsity in the solution. 

Finally we compare the three algorithms for very large problems in Fig. [3j Clearly the proposed DAL 

has milder scaling to the dimensionality than both SpaRSA and 11 Is. This is because the proposed DAL 

algorithm is based on the dual problem (Eq. (f2])). The computational efficiency of DALchol and DALcg 
is comparable because m is kept constant in this experiment. The initial barrier parameter 771 = 100000 
seems to perform better than t]i = 1000 for large n. 

IV. Conclusion 

In this paper we have proposed a new optimization framework for sparse signal reconstruction, which 
converges super-linearly. It is based on the dual sparse reconstruction problem. The sparsity of the 
coefficient vector w is explicitly used in the algorithm. Numerical comparisons have shown that the 
proposed DAL algorithm is favorable against a state-of-the-art algorithm SpaRSA when the design matrix 
A is poorly conditioned or m « n. In fact, it has solved problems with millions of variables in less than 
20 minutes even when the design matrix A is dense. In addition, for dense A, DAL has shown improved 
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Fig. 3. Comparison of the algorithms for large scale problems when the number of variable (n) is much larger than the number 
of observations (m). m is kept constant at m — 1024. A is decreased as A = 1.6/n 1 . 



efficiency to 11 Is in most cases. Future work includes generalization of DAL to other loss functions and 

sparsity measures, continuation strategy, and approximate minimization of the inner problem. 
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